-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(python): expose create
to DeltaTable class
#1912
Conversation
create
to DeltaTable class
This super exciting!! What's the full list of table features this syntax should support / should be able to support in the future?
This is great! |
@MrPowers I think those two at least. For the constraints we could probably do something like this: check_constraints: Dict[str, str] Example: For generated columns, I need to think about what's ideal there, depends also a bit on the implementation in the rust side. Maybe: {'col1': {'dtype': 'str', 'expr': 'concat(col2, col3)' |
@ion-elgreco - cool, awesome, just wanted to make sure we're doing what we can to make this interface future-proof! |
@MrPowers perhaps we can expand the Delta Schema class to define generated columns, then they can be passed together with the rest of the schema. |
7499a90
to
a358aa3
Compare
a358aa3
to
18156fb
Compare
Looks good to me (after the docstrings are added :)) |
- Adds rust writer as additional engine in python - Adds overwrite schema functionality to the rust writer. @roeap feel free to point out improvements 😄 A couple gaps will exist between current Rust writer and pyarrow writer. We will have to solve this in a later PR: - Replacewhere (partition filter / predicate) overwrite (users however can solve this by doing DeltaTabel.delete and then append) - closes delta-io#1861 --------- Signed-off-by: Nikolay Ulmasov <[email protected]> Co-authored-by: Robert Pack <[email protected]> Co-authored-by: Robert Pack <[email protected]> Co-authored-by: David Blajda <[email protected]> Co-authored-by: Nikolay Ulmasov <[email protected]> Co-authored-by: Matthew Powers <[email protected]> Co-authored-by: Thomas Frederik Hoeck <[email protected]> Co-authored-by: Adrian Ehrsam <[email protected]> Co-authored-by: Will Jones <[email protected]> Co-authored-by: Marijn Valk <[email protected]>
# Description Current implementation of `ObjectOutputStream` does not invoke flush when writing out files to Azure storage which seem to cause intermittent issues when the `write_deltalake` hangs with no progress and no error. I'm adding a periodic flush to the write process, based on the written buffer size, which can be parameterized via `storage_options` parameter (I could not find another way without changing the interface). I don't know if this is an acceptable approach (also, it requires string values) Setting the `"max_buffer_size": f"{100 * 1024}"` in `storage_options` passed to `write_deltalake` helps me resolve the issue with writing a dataset to Azure which was otherwise failing constantly. Default max buffer size is set to 4MB which looks reasonable and used by other implementations I've seen (e.g. https://github.com/fsspec/filesystem_spec/blob/3c247f56d4a4b22fc9ffec9ad4882a76ee47237d/fsspec/spec.py#L1577) # Related Issue(s) Can help with resolving delta-io#1770 # Documentation If the approach is accepted then I need to find the best way of adding this to docs --------- Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description Save user from ending up with failed `load` function call and new folder created - failing fast in case user is trying to load some path that doesn't exist # Related Issue(s) - closes delta-io#1916
# Description A second attempt to extend the write_deltalake to accept either PyArrow or Deltalake schema (messed up the previous PR with some rebase issues) Added a test # Related Issue(s) closes delta-io#1862 --------- Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description Adds a documentation page on the Delta Lake Arrow integration.
According to the issue test should fail to load table without snapshot (version 0) but test is written to test that it is possible to read and load Delta Table with version 0 into the Rust (functions `open_table` and `open_table_with_version` work)
…ings (delta-io#1895) Delta protocol specifies 2 possible formats for timestamp partitions: {year}-{month}-{day} {hour}:{minute}:{second} or {year}-{month}-{day} {hour}:{minute}:{second}.{microsecond} However, string comparison of partition filter value and partition values was performed, which rendered timestamps like 2020-12-31 23:59:59.000000 and 2020-12-31 23:59:59 as different. This change uses timestamp comparison instead of string comparison. Co-authored-by: Igor Borodin <[email protected]>
Created a new PR here: #1932. since I keep messing up the rebase for some reason. |
Description
Allows one to create a table without writing.
Related Issue(s)